Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add @Utf16String annotation to map java.lang.String to unsigned short* #442

Merged
merged 2 commits into from
Feb 12, 2021

Conversation

equeim
Copy link
Contributor

@equeim equeim commented Dec 16, 2020

Currently, if your native string type is UTF-16 encoded, and you want to map it to java.lang.String, you need to expose it as CharPointer and call its getString() method. It works, but it involves creation of intermediary Java object (CharPointer), C++ -> Java method call to inititalize CharPointer, and additional array copy.

This patch allows to do it much faster by directly using NewString() JNI function to directly create String from const jchar* (aka const unsigned short*).

@saudet
Copy link
Member

saudet commented Dec 16, 2020

NewString() is fine I suppose, but GetStringChars() does not return null terminated strings, so we can't use it in the general case.

@equeim
Copy link
Contributor Author

equeim commented Dec 19, 2020

You are right, I replaced it with GetStringRegion.
By the way, this doesn't work with StringAdapter("char16_t") when passing string to C++ on UNIX platforms yet, because of bug in StringAdapter (it assumes that unsigned short == wchar_t). I plan to rewrite StringAdapter in more generic way to fix this (also, am I correct that C++11 can't be used in generated C++ code?)

@saudet
Copy link
Member

saudet commented Dec 19, 2020

GetStringRegion() is probably not going to be any faster than CallObjectMethod(), but I may be wrong. What is the difference in speed?

By the way, this doesn't work with StringAdapter("char16_t") when passing string to C++ on UNIX platforms yet, because of bug in StringAdapter (it assumes that unsigned short == wchar_t). I plan to rewrite StringAdapter in more generic way to fix this (also, am I correct that C++11 can't be used in generated C++ code?)

I don't see where StringAdapter assumes that wchar_t is 16-bit. Where does that happen?

C++11 shouldn't be used if it's not necessary, yes, but of course there isn't any problem when the classes we want to map are only available in C++11 anyway.

@equeim
Copy link
Contributor Author

equeim commented Dec 19, 2020

Difference in speed comes from not having intermediary CharPointer.
In my Java code, I don't work with CharPointers, I work with Strings. When I pass them to C++, I convert them to CharPointer at the place of calling of native function, and when I pass CharPoiner to Java I call getString() on it immediately because I need java.lang.String, not CharPointer. These conversions add additional overhead.

I did some tests on OpenJDK 11 on Linux, and passing unsigned short* to Java in form of @Utf16String String is about 4x faster than passing it as CharPointer and calling getString() on it.
Passing @Utf16String String to C++ is about 12x faster than creating a new CharPointer from String and passing it to C++.

Of course, these were very synthetic tests, but difference is there.

I don't see where StringAdapter assumes that wchar_t is 16-bit. Where does that happen?

StringAdapter(const unsigned short* ptr, typename std::basic_string<T>::size_type size, void* owner) : ptr((T*)ptr), size(size), owner(owner),
        str2(ptr ? (T*)ptr : L"", ptr ? (size > 0 ? size : wcslen((wchar_t*)ptr)) : 0), str(str2) { }

operator     unsigned   short*() { return (unsigned short*)(operator wchar_t*)(); }

@saudet
Copy link
Member

saudet commented Dec 19, 2020

Interesting, I did a bit of micro benchmarking with a small ASCII string and GetStringRegion() seems to be as fast as GetByteArrayElements(), so it does seem like it's storing strings as UTF-16 internally, even for Java 11 even though it shouldn't be:
https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16
Maybe the trick that they are using allows for fast copy anyway? Very interesting.

We can cast the data from StringAdapter to int as well, so that should cover pretty much all implementations out there:

            out.println("    operator     unsigned   short*() { return (unsigned short*)(operator wchar_t*)(); }");
            out.println("    operator       signed     int*() { return (  signed   int*)(operator wchar_t*)(); }");

If what you'd like to use are char16_t and char32_t, that's only available from C++11, yes. We can probably add those to StringAdapter with some if __cplusplus >= 201103L || _MSC_VER >= 1900 though. Have you tried?

@saudet
Copy link
Member

saudet commented Jan 22, 2021

Let's try to think of friendlier names for those annotations. Under C11 and C++11, jchar is better known as char16_t, so we may want to have something that casts directly to that instead of jchar, something like @AsChar16T, maybe? However, on second thought, char16_t is a typedef of uint_least16_t that is not guaranteed to be a 16-bit type, which is weird, so it wouldn't make sense to force the association with that I suppose. We may also want to get "modified UTF-8" for performance reasons, see thread about that in issue #70, and there's a char8_t for that in C++20 apparently, but it's not something widely used, yet, so char is sufficient for that for now, but if we name the annotation for that @AsChar8T, it would be majorly confusing. There's also char32_t in C11 and C++11, but there's no efficient method to retrieve code points with JNI, so I guess we don't need to think about that, for now.

Qt, probably the most popular platform in C++, has methods like toUtf8(), utf16(), etc, see https://doc.qt.io/qt-6/qstring.html, so maybe annotations like @AsUtf8 and @AsUtf16 would be clearest after all. We could have those casted to const char* and const char16_t*, respectively, by default. I think it would make sense to assume that char16_t is 16-bit on almost all platforms, similarly to how we're assuming that int is 32-bit on all platforms that Java runs on. That way we'd also be able to use more easily the new annotation from pull #448 this way: @StdU16String @AsUtf16 String myString; and similarly for "modified UTF-8" from JNI: @StdString @AsUtf8 String myString. What do you think?

@saudet
Copy link
Member

saudet commented Jan 22, 2021

Actually, to be consistent with CharPointer, in the case of @AsUtf16, we should cast to unsigned short*. We could compensate for that by making the @StdU16String annotation map to @Adapter("BasicStringAdapter<char16_t, unsigned short>"), which should work fine, and wouldn't force users into C++11.

@equeim
Copy link
Contributor Author

equeim commented Jan 27, 2021

Under C11 and C++11, jchar is better known as char16_t

OpenJDK typedefs jchar as unsigned short on both Linux and Windows, which is distinct type from char16_t. Android NDK typedefs it to uint16_t which itself is usually typedef to unsigned short (except on platforms where unsigned short is not 16 bit, but I don't think Android supports them). Also, char16_t is required to have the same properties as uint_least16_t, but is always a distinct type.
But I agree that @AsUtf16 looks better.

and similarly for "modified UTF-8" from JNI: @stdstring @AsUtf8 String myString

You mean, using @AsUtf8 to convert to/from Modifed UTF-8 instead of normal UTF-8? Seems counterintuitive to me, it definitely should be in the name. Also, is there anyone who actually needs to pass Modified UTF-8 strings to/from native code (instead of UTF-8)? I though that it is mostly used by JVM internally.

We could compensate for that by making the @StdU16String annotation map to @Adapter("BasicStringAdapter<char16_t, unsigned short>"), which should work fine, and wouldn't force users into C++11.

I think you meant BasicStringAdapter<unsigned short, char16_t>, since it's defined as template<typename P, typename T = P> (i.e. first parameter is pointer that adapter converts to/from, and second one is template parameter of basic_string). I'm not sure if we want to hardcode @StdU16String with BasicStringAdapter<unsigned short, char16_t> though, since P depends on pointer type. For example, for String with @AsUtf16 and CharPointer it will be unsigned short but for ShortPointer it will be short. But this is discussion for #448.

@saudet
Copy link
Member

saudet commented Jan 28, 2021

You mean, using @AsUtf8 to convert to/from Modifed UTF-8 instead of normal UTF-8? Seems counterintuitive to me, it definitely should be in the name. Also, is there anyone who actually needs to pass Modified UTF-8 strings to/from native code (instead of UTF-8)? I though that it is mostly used by JVM internally.

Well, "modified" is not in the names of the corresponding JNI functions, so I don't think we need to be too worried about that. You want to have it like @AsModifiedUtf8? Why not I suppose. I don't think anyone needs modified UTF-8, but it's faster than going through Java code. That would be pretty much the only reason why one may want to use that I believe.

I think you meant BasicStringAdapter<unsigned short, char16_t>, since it's defined as template<typename P, typename T = P> (i.e. first parameter is pointer that adapter converts to/from, and second one is template parameter of basic_string). I'm not sure if we want to hardcode @StdU16String with BasicStringAdapter<unsigned short, char16_t> though, since P depends on pointer type. For example, for String with @AsUtf16 and CharPointer it will be unsigned short but for ShortPointer it will be short. But this is discussion for #448.

Yes, BasicStringAdapter<unsigned short, char16_t> is what I meant. It's not hard coded, it's always possible to deal with it using mainly @Cast, for example, native void process(@StdU16String @Cast("unsigned short*") ShortPointer myUtf16Data). It's also possible to do native void process(@Adapter("BasicStringAdapter<signed short, char16_t>") ShortPointer myUtf16Data), among other variants in the same theme.

@@ -937,23 +938,44 @@ boolean classes(boolean handleExceptions, boolean defineAdapters, boolean conver
out.println("}");
out.println();
if (handleExceptions || convertStrings) {
out.println("static JavaCPP_noinline jstring JavaCPP_createString(JNIEnv* env, const char* ptr) {");
out.println("#include <string>");
out.println("static JavaCPP_noinline jstring JavaCPP_createStringUTF8(JNIEnv* env, const char* ptr, size_t length) {");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methods like String(byte[]) and String.getBytes() actually work for the native charset of the underlying OS, not necessarily UTF-8. We could add custom methods that somehow work with a user-specified encoding, but the default behavior of those methods should stay the same, and they shouldn't be named "UTF8".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed that.
I actually started to implement a more generic solution that would allow to pass any Charset to createString/getBytes (so that String could be created from byte array in any encoding), when I thought: how do we determine that byte array returned from getBytes() can be safely null-terminated?. I mean, when we are using default charset, there is no guarantee that returned array won't have null byte somewhere in the middle. Do we just shift responsibility to user?

Copy link
Member

@saudet saudet Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null character gets added here manually after GetByteArrayRegion(), just like you did with GetStringRegion():
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/tools/Generator.java#L972
As for passing a Charset, I'm thinking something defined globally at compile time as per issue #70 (comment) would be best. I'm not sure how we'd go about passing a Java object rather than just a string like that though. The latter entails potential exceptions at initialization time, but I guess that wouldn't be so bad. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about passing a charset name to getBytes(java.lang.String)/String(byte[], java.lang.String) overloads. Generator would extract charset name from annotation, hardcode it in C++ code which will create jstring from it using NewStringUTF (we can use modified utf-8 here I believe) and pass it to to getString()/String(). It has an advantage over define in that you could use different charsets for different functions.

The downside is that these NewStringUTF() calls will introduce additional overhead (JMH said that it would be almost 50% slower than using String(byte[])) unless you find some way to cache them.

Setting encoding via define would of course also require passing jstring with charset name (unless you meant some other method?) but then you can easily cache it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like this too, but the issue with passing a charset name like that for every call is that it may through an exception at runtime, any time. It's better to resolve it at load time into a Charset object.

@saudet saudet merged commit 12a8a2f into bytedeco:master Feb 12, 2021
@saudet
Copy link
Member

saudet commented Mar 9, 2021

BTW, to obtain even higher performance, we should also add support for these functions via @CriticalRegion:
https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringCritical_ReleaseStringCritical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants